Fix process crash when testing Stderr#1449
Fix process crash when testing Stderr#1449ericstj wants to merge 33 commits intomodelcontextprotocol:mainfrom
Conversation
|
This may not be the root cause of the problem. We've now got dumps on both windows and linux. I'll have a look and see if the dumps reveal a better root-cause. |
|
Crashes fixed, but we have tests hanging. Could be due to the |
|
Yeah, WaitForExit is now hanging... I'm not convinced this is the right change. I'm wondering more "why aren't processes being cleaned up" and do we have a hang or test bug somewhere. Also - the fragility here where missing something means a crash seems busted. I think we should have a design that's resilient to not cleaning up external state. |
|
Found leaked StdioServerTransport objects in dumps, these kept handles open to StdIn and were causing threadpool starvation. -- at least that's the theory.
|
StdioServerTransport does not reliably close on linux, which causes a threadpool thread to be blocked on read. This can lead to threadpool starvation and application hangs. Since the tests in question do not require stdio transport, switch to using StreamServerTransport with null streams to avoid this issue.
src/ModelContextProtocol.Core/Client/StdioClientSessionTransport.cs
Outdated
Show resolved
Hide resolved
src/ModelContextProtocol.Core/Client/StdioClientSessionTransport.cs
Outdated
Show resolved
Hide resolved
When CleanupAsync is entered from ReadMessagesAsync (stdout EOF), the cancellation token is _shutdownCts.Token which hasn't been canceled yet (base.CleanupAsync runs later). If stderr EOF hasn't been delivered by the threadpool yet, WaitForExitAsync hangs indefinitely. Use a linked CTS with ShutdownTimeout to bound the wait. The process is already dead; we're just draining stderr pipe buffers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
xUnit v3 test host hangs intermittently on .NET 10 when RuntimeAsync is enabled (the default). The hang occurs after all tests have passed, with the test host process stalling indefinitely. Setting DOTNET_RuntimeAsync=0 reliably prevents the hang. This is a temporary workaround pending a fix in the .NET runtime. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TEMPORARY diagnostic instrumentation to capture what happens during the intermittent xUnit v3 RunTest hang where finished TCS is never signaled. Changes: - Instrumented xunit.v3.core.dll with trace points in TestRunner.RunTest: before/after EC.Run, runTest entry, pre-try completion, test invocation, and finished signaling. Also moves pre-try code inside try block as a structural fix. - DiagnosticExceptionTracing.cs module initializer hooking UnhandledException and UnobservedTaskException for additional context. - Both test csproj files copy instrumented DLLs post-build via ReplaceXunitWithInstrumented target. - CI workflow collects xunit-runtest-diag.log as test artifact. All instrumentation writes to xunit-runtest-diag.log and stderr. Remove once the root cause is identified. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The async chain from ConnectAsync through initializeTask/processingTask can become GC-unreachable when the server process exits before responding. Task.Delay creates a System.Threading.Timer that is directly rooted in the runtime's timer queue, providing a simple GC root chain: Timer Queue -> Timer -> Task.Delay -> WhenAny -> ConnectAsync. This ensures ConnectAsync always resumes within the initialization timeout, even if intermediate state machines are collected. Use a dedicated CTS for the delay (not the caller's cancellationToken) so external cancellation flows through initializeTask via initializationCts, preserving the expected OperationCanceledException propagation path. Only call FailPendingRequests when processingTask wins (server exited). When delayTask wins, let the CancelAfter timer cancel initializationCts naturally, producing TimeoutException through the existing OCE catch block. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
xUnit v3's TestRunner.RunTest wraps test execution in an async void callback (required by ExecutionContext.Run's ContextCallback signature). When the test's async continuation chain forms a cycle with no external GC root (no I/O, no timers), the entire object graph is collectible. In Release mode on Ubuntu, the JIT's aggressive optimizations make this reproducible — the async void state machine gets collected, the test never completes, and the test host hangs. The fix extracts the async void body into an async Task method and holds a reference to the returned Task, keeping the entire async chain GC-rooted for the lifetime of RunTest. Also reverts prior ConnectAsync/McpSessionHandler workarounds that were attempting to address this from the product side. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace async void with plain void for the ExecutionContext.Run callback. The test body runs in an async Task method (runTestCore) whose returned Task is pinned via GCHandle.Alloc — a root in the GC handle table that is independent of the managed object graph. This is stronger than GC.KeepAlive and avoids async void entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace async void with async Task for the test body, await the Task directly. Eliminates GCHandle, finished TCS, and async void entirely. This is cleaner xUnit code but does not fix the intermittent test hang, which has a separate root cause in the product code. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
StreamClientSessionTransport.DisposeAsync could hang indefinitely when awaiting Completion because the channel writer was never completed. Root cause: ReadMessagesAsync and DisposeAsync race to call CleanupAsync. The _cleanedUp Interlocked guard ensures only one runs the full body (which calls SetDisconnected to complete the channel). The loser calls CancelShutdown() and returns immediately — without waiting for the winner to finish. If DisposeAsync then does 'await Completion', it blocks on a channel that the still-running cleanup hasn't completed yet. Fix: call SetDisconnected() after CleanupAsync in DisposeAsync as a safety net. SetDisconnected is idempotent (uses Interlocked.Exchange), so a duplicate call is a no-op. Also removes the instrumented xUnit DLLs and MSBuild target that were added to test a disproven GC-collection hypothesis. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
First fully green build https://github.com/modelcontextprotocol/csharp-sdk/actions/runs/23910228803/job/69729961887?pr=1449, now to rerun. |
|
https://github.com/modelcontextprotocol/csharp-sdk/actions/runs/23910228803/job/69732069649?pr=1449 Rerunning |
|
https://github.com/modelcontextprotocol/csharp-sdk/actions/runs/23910228803/job/69734695843?pr=1449 Another build, another flaky test. This time it's ReadEventsAsync_RespectsModeSwitchFromStreamingToPolling -- seems to be another synchronization issue -- not a hang. Opened a PR for this. Rerunning. |
|
Woohoo -- another green build https://github.com/modelcontextprotocol/csharp-sdk/actions/runs/23910228803/job/69757289201?pr=1449 |
|
And another green build -- https://github.com/modelcontextprotocol/csharp-sdk/actions/runs/23910228803/job/69763953157?pr=1449 |
GetUnexpectedExitExceptionAsync used a linked CancellationTokenSource that included the caller's token (_shutdownCts.Token). When ReadMessagesAsync and DisposeAsync race to call CleanupAsync, the loser calls CancelShutdown() which cancels _shutdownCts. This prematurely aborted WaitForExitAsync in the winner's cleanup path, preventing stderr pipe buffers from being fully drained. The ErrorDataReceived callback would never fire for lines still in the pipe, causing CreateAsync_ValidProcessInvalidServer_StdErrCallbackInvoked to fail intermittently. Fix: use a standalone timeout CTS instead of linking to the caller's token. The process is already dead at this point—we only need a timeout to bound the pipe drain, not external cancellation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Ok, this should be good to go. Updated the summary at the top of the PR. @stephentoub @halter73 -- do your worst. |
|
https://github.com/modelcontextprotocol/csharp-sdk/actions/runs/23922574167/job/69772194173?pr=1449 Rerunning to prove consistently green. |
Extract WaitForProcessExitAsync to ensure ErrorDataReceived events are fully drained before the handler is detached. Fixes a race where HasExited returns false while the process is exiting (stdout closed but not yet reaped), causing GetUnexpectedExitExceptionAsync to skip the drain and the handler to be detached before callbacks arrive. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
https://github.com/modelcontextprotocol/csharp-sdk/actions/runs/23928222006/job/69798710432?pr=1449 ✅ 6 clean runs. I think this is ready. |

Fix #1448
Summary of fixes
Product changes
StreamClientSessionTransport.DisposeAsync— Safety-netSetDisconnected()afterCleanupAsyncprevents indefinite hang whenReadMessagesAsyncandDisposeAsyncraceStdioClientSessionTransport.GetUnexpectedExitExceptionAsync— Standalone timeout CTS instead of linked token preventsCancelShutdown()from prematurely aborting stderr pipe drainStdioClientTransport— try/catch around user'sStandardErrorLinescallback prevents crashing the host; named handler + detach enables proper cleanupTest fixes
WithStdioServerTransport()withWithStreamServerTransport(Stream.Null, Stream.Null)—StdioServerTransportblocks a thread pool thread on stdinCreateAsync_StdErrCallbackThrows_DoesNotCrashProcess@stephentoub